Structural Equation Modeling
Invalid Date
lavaanI am convinced that SEM is a fundamental tool for research in psychology and most, if not all, researchers in this area should know it. Indeed, it is key for many aspects of your research:
I am not a statisticians. This will have negative consequences on your statistical knowledge at the end of the course, but hopefully more practical and psychology-based examples and experiences.
for’ loops knowledgeThe material is divided in arguments
For each argument you will find
Slides
Additional code
Data
We will probably do live coding when needed. I will work on this file: LiveCode
I also prepared this file where we can collect questions, if it is too early to answer or if you want to save it: Q doc
In general, live materials are in this folder
Slides and materials are in the Moodle page of the course
OR moodle psicologia unipd > Formazione Post Lauream > Corsi di Dottorato > Psychological Sciences aa 2023/2024 > Structural Equations
logbook: Please fill the logbook everyday.
SEM is a multivariate statistical modeling technique
SEM allows us to test a hypothesis/model about the data
What is so special about SEM?
SEM works with matrices
\(\boldsymbol{S}\) observed var-cov
\(\boldsymbol{\Sigma}\) true var-cov
\(\boldsymbol{\hat{\Sigma}}\) model-implied var-cov
\(\boldsymbol{\Sigma}(\theta)\)
THE MAIN AIM OF SEM IS TO RECONSTRUCT THE TRUE VARIANCE-COVARIANCE MATRIX
Variables are the way those attributes that vary across individuals are operationalized and represented for further data processing. These can be categorized according to many criteria (e.g, dependent/independent…), but in SEM we classify them firstly as:
Latent variables
hypothetical variables that correspond to more or less abstract concepts
formative or reflective
examples are intelligence, anxiety, executive functions, personality traits…
Observed variables
variables that can be directly observed and measured
examples can be weight, height, gender, income…
In SEM we also have an additional type of classification:
Exogenous variables
Variables whose causes lie outside the model; they will be used only as predictors in the model. They do not receive arrows.
They are indicated with \(x\), if observed, or with \(\xi\), if latent.
Endogenous variables
Variables that are determined by variables within the model (they receive arrows); can be used as predictors or dependent variables in the model.
They are indicated with \(y\), if observed, or with \(\eta\), if latent.
This brings us to deepen the relationships between variables.
The general aim of statistical analysis is to study relationship among variables
On the basis of the relationship among the variables, we distinguish two kind of models: symmetrical and asymmetrical.
X -> Y
Variables are divided into two sets: dependent or response variables and predictors or explanatory variables
X is the set of explanatory variables, \(Y\) is the set of response variables, arrows represents the direction of the hypothesized relationship.
These models imply cause-and-effect relationships.
Example
People who study more obtain higher grades.
\[ X_i \Leftrightarrow Y_j \quad \forall i,j \]
This means that neither a variable causes the other, neither a variable can be considered prior in time to the other; all these relationships are bidirectional.
These models do not imply nor consider causality.
Example
People who have higher grades in math have higher grades in art.
Asymmetrical relationships are usually tested with regressions!
As you remember, regression models can be written, using classical formulation, as the expression below and graphically depicted (getting closer to SEM) like this:
But what if we have in mind a more complex pattern of relationships? What if we have more regression models in mind and need to estimate all of them contemporarily?
What we need is a system of equations.
This system can also be drawn with SEM notation, but is actually the same…just better!
The covariance matrix is the input for the estimation process. In general, given \(q\) exogenous (\(x\)) and \(p\) endogenous (\(y\)) variables, the covariance matrix will be:
In which the diagonal elements are variances and off diagonal elements are covariances.
Variables
\(x\) exogenous observed (\(q\))
\(\xi\) exogenous latent (\(n\))
\(y\) endogenous observed (\(p\))
\(\eta\) endogenous latent (\(m\))
Stochastic errors
\(\delta\) measurement errors in \(x\)
\(\epsilon\) measurement errors in \(y\)
\(\zeta\) equation errors in the structural relationship between \(\eta\) and \(\xi\)
Parameter matrices
\(\boldsymbol{\Lambda}\) relationship between latent (\(\xi\) and \(\eta\)) and observed (\(x\) and \(y\)) [\((p + q) X (m + n)\)]
\(\boldsymbol{B}\) relationship between latent variables [\((m + n) X (m + n)\)]
Covariance matrices
\(Cov\)(\(\zeta\), \(\xi\)) = \(\boldsymbol{\Psi}\) matrix [\((m + n) X (m + n)\)]
\(Cov\)(\(\epsilon\), \(\delta\)) = \(\boldsymbol{\Theta}\) matrix [\((p + q) X (p + q)\)]
The SEM model in its most general form consists of two parts
The measurement model
\(x = \boldsymbol{\Lambda}_x\boldsymbol{\xi} + \boldsymbol{\delta}\)
\(y = \boldsymbol{\Lambda}_y\boldsymbol{\eta} + \boldsymbol{\epsilon}\)
The structural model
\(\boldsymbol{\eta} = \boldsymbol{B\eta} + \boldsymbol{\Gamma\xi} + \boldsymbol{\zeta}\)
\(\boldsymbol{\eta} = \boldsymbol{B(\eta\xi} + \boldsymbol{\zeta})\)
Expected values of latent variables and stochastic errors are 0:
\(E\)(\(\eta\)) = 0
\(E\)(\(\xi\)) = 0
\(E\)(\(\zeta\)) = 0
\(E\)(\(\epsilon\)) = 0
\(E\)(\(\delta\)) = 0
Errors are uncorrelated with latent variables and are mutually uncorrelated:
There are 5 principal steps in Structural Equation Modeling:
model specification
model identification
parameters estimation
testing
model modification
As usual, these steps are like a cycle: when you arrive at step 5 you can always come back to step 1.
Aim of the model
What is a model
Examples
Basically, we want to know if there is enough information to identify a solution (aka estimate all the unknown parameters).
A model can be:
Basically, we want to know if there is enough information to identify a solution (aka estimate all the unknown parameters).
A model can be:
Under-identified: there are MORE parameters to be estimated than elements in the covariance matrix (e.g., \(X + Y = 10\))
Just-identified: the number of parameters to be estimated equals the number of elements in the covariance matrix (\(df = 0\))
Over-identified: there are LESS parameters to be estimated than elements in the covariance matrix (\(df > 0\))
To ensure that the number of unknown parameters (\(t\)) is not greater than the number of nonredundant elements in the covariance matrix of \(q\) observed variables. We can use the following formula:
\[ t \leq \frac{q(q+1)}{2} \]
To ensure that the number of unknown parameters (\(t\)) is not greater than the number of nonredundant elements in the covariance matrix of \(q\) observed variables. We can use the following formula:
\[ t \leq \frac{q(q+1)}{2} \]
To ensure that the number of unknown parameters (\(t\)) is not greater than the number of nonredundant elements in the covariance matrix of \(q\) observed variables. We can use the following formula:
\[ t \leq \frac{q(q+1)}{2} \]
To estimate the model parameters we can use different estimation methods. These aim to estimate the model implied (theoretical) correlation matrix \(\boldsymbol{\Sigma}\), which is a function of the model parameters, and should hopefully be similar to the observed correlation matrix \(\boldsymbol{S}\).
Some of the many estimation methods are:
Maximum Likelihood (ML), default in lavaan
Unweighted Least Squares (ULS)
Generalized Least Squares (GLS)
Diagonally Weighted Least Squares (DWLS), default for ordinal variables in lavaan
Is the model adequate? Are our parameter able to construct a theoretical matrix (\(\boldsymbol{\Sigma}\)) which is close to the original empirical covariance matrix \(\boldsymbol{S}\)?
This is the goal of a good model: reproduce, from a set of theoretical associations/effects (aka covariance matrix), the original covariance matrix.
Formally:
\[ H_0 : \boldsymbol{\hat{\Sigma}}(\theta) = \boldsymbol{\Sigma} \] where \(\boldsymbol{\Sigma}\) is the true covariance matrix among model variables, \(\theta\) the parameters vector, and \(\boldsymbol{\hat{\Sigma}}\) the reproduced covariance matrix.
At this point you are free to modify the model based on the results obtained…AND THE THEORY!
img credits to dr. Johnny Lin
If that all seemed difficult and boring, now comes the funny part: colors, figures, and arrows!
Graphical representation is a key attribute of structural equation modeling:
It helps understanding the model
It helps thinking and reasoning about the model (a priori)
It helps writing and formalizing the model
It is easy, but few rules must be followed to have a readable model
Latent variables are circles or ellipses
Manifest/observed variables are square or rectangular boxes
Errors are represented by corresponding letters (or values) only
\[ \delta_1 / \epsilon_1 / \zeta_1 \]
All model relationships are represented by arrows;
NO relationship NO arrow...
...and usually NO arrow NO relationshipEach arrow is a model parameter and has two indices (e.g., \(\beta_{21}\))
Asymmetrical relationship are represented by a single headed arrow: the first index indicates the variable the arrow is pointing to, the second index indicates the variable of origin.
Symmetrical relationships are represented by double-headed arrows and two indices, one for each variable.
A summary
All errors have a single headed arrow pointing to a variable; all variables, except \(\xi\), may have an error.
Double-headed arrows associated to errors indicate error variances.
img credits to dr. Johnny Lin
…and much more
THERE IS EVEN A JOURNAL ON SEM
Structural Equation Modeling: A Multidisciplinary Journal